A DNF Blocking Scheme Learner for Heterogeneous Datasets
نویسندگان
چکیده
Entity Resolution concerns identifying co-referent entity pairs across datasets. A typical workflow comprises two steps. In the first step, a blocking method uses a one-many function called a blocking scheme to map entities to blocks. In the second step, entities sharing a block are paired and compared. Current DNF blocking scheme learners (DNFBSLs) apply only to structurally homogeneous tables. We present an unsupervised algorithmic pipeline for learning DNF blocking schemes on RDF graph datasets, as well as structurally heterogeneous tables. Previous DNF-BSLs are admitted as special cases. We evaluate the pipeline on six real-world dataset pairs. Unsupervised results are shown to be competitive with supervised and semi-supervised baselines. To the best of our knowledge, this is the first unsupervised DNF-BSL that admits RDF graphs and structurally heterogeneous tables as inputs.
منابع مشابه
Adaptive Candidate Generation for Scalable Edge-discovery Tasks on Data Graphs
Several ‘edge-discovery’ applications over graph-based data models are known to have worst-case quadratic complexity, even if the discovered edges are sparse. One example is the generic link discovery problem between two graphs, which has invited research interest in several communities. Specific versions of this problem include link prediction in social networks, ontology alignment between met...
متن کاملN-Way Heterogeneous Blocking
Record linkage concerns the linkage of records between two tabular datasets. To avoid naive quadratic computation, typical solutions employ a technique called blocking. A blocking scheme partitions records into blocks, and generates a candidate set by pairing records within a block. Current models of blocking have been restricted to two homogeneous datasets. The variety aspect of Big Data motiv...
متن کاملA two-step blocking scheme learner for scalable link discovery
A two-step procedure for learning a link-discovery blocking scheme is presented. Link discovery is the problem of linking entities between two or more datasets. Identifying owl:sameAs links is an important, special case. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n) comparisons by clustering entities into blocks, and limiting the evaluation of l...
متن کاملAn Eecient Membership-query Algorithm for Learning Dnf with Respect to the Uniform Distribution
We present a membership-query algorithm for eeciently learning DNF with respect to the uniform distribution. In fact, the algorithm properly learns with respect to uniform the class TOP of Boolean functions expressed as a majority vote over parity functions. We also describe extensions of this algorithm for learning DNF over certain nonuniform distributions and for learning a class of geometric...
متن کاملAn Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution
We present a membership-query algorithm for ef i ciently learning DNF with respect to the uniform distribution. In fact, the algorithm properly learns the more general class of functions that are computable as a majority of polynomially-many parity functions. We also describe extensions of this algorithm for learning DNF over certain nonuniform distributions and from noisy examples as well as f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1501.01694 شماره
صفحات -
تاریخ انتشار 2015